（Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 by baonudesifeizhai · Pull Request #2913 · vllm-project/vllm-omni

baonudesifeizhai · 2026-04-19T06:46:00Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

#2709

This PR adds Phase 1 support for ModelOpt FP8 diffusion checkpoints.

Auto-detects quantization_config from diffusion checkpoint configs.
Resolves generic fp8 stage configs to checkpoint-specific ModelOpt FP8 when serialized ModelOpt metadata is present.
Adds a ModelOpt FP8 checkpoint adapter for diffusers-style weight loading.
Extends HunyuanImage-3 ModelOpt FP8 loading for attention and MoE scalar scales.
Adds FP8 stage configs for supported image backbones.

Validation

Validated ModelOpt FP8 image generation on:

Flux
Flux2-Klein
Qwen-Image
HunyuanImage-3

Benchmark Setup

All results below use the following settings unless otherwise noted:

num-prompts=100
request-rate=inf
warmup-requests=0
width=1024
height=1024
num-inference-steps=20
seed=42

For online serving benchmarks, we use:

max-concurrency=32

Note:

Offline results are sequential offline benchmarks.

Online results are serving benchmarks under concurrency 32.

ModelOpt FP8 refers to pre-quantized offline checkpoints loaded through the ModelOpt checkpoint path.

For HunyuanImage3, the offline results were run on 4 GPUs while the online results were run on 2 GPUs, so only BF16 vs ModelOpt FP8 within the same mode should be compared directly.

BF16 vs ModelOpt FP8

Model	Mode	BF16 Throughput (req/s)	ModelOpt FP8 Throughput (req/s)	BF16 Mean Latency (s)	ModelOpt FP8 Mean Latency (s)	BF16 Peak Mem (MB)	ModelOpt FP8 Peak Mem (MB)	ModelOpt FP8 vs BF16
HunyuanImage3	Offline	0.2362	0.2606	4.1992	3.8015	N/A	N/A	+10.3% throughput, -9.5% latency
HunyuanImage3	Online	0.19	0.23	142.5196	119.0200	66526	65898	+21.1% throughput, -16.5% latency, -0.9% peak memory
Qwen-Image-2512	Offline	0.2853	0.3139	3.4734	3.1535	N/A	N/A	+10.0% throughput, -9.2% latency
Qwen-Image-2512	Online	0.29	0.31	93.7505	86.3423	59404	52764	+6.9% throughput, -7.9% latency, -11.2% peak memory
Z-Image	Offline	0.3717	0.3879	2.6621	2.5444	N/A	N/A	+4.4% throughput, -4.4% latency
Z-Image	Online	0.38	0.39	72.0314	68.8272	23852	22052	+2.6% throughput, -4.4% latency, -7.5% peak memory
FLUX.2-dev	Offline	0.1040	0.0913	9.5824	10.9228	N/A	N/A	-12.2% throughput, +14.0% latency
FLUX.2-dev	Online	0.21	0.18	131.8690	150.1522	87280	72304	-14.3% throughput, +13.9% latency, -17.2% peak memory
FLUX.2-klein-4B	Offline	1.5477	1.5823	0.6122	0.5942	N/A	N/A	+2.2% throughput, -2.9% latency
FLUX.2-klein-4B	Online	1.63	1.69	16.5748	16.0315	21418	19670	+3.7% throughput, -3.3% latency, -8.2% peak memory

Offline vs Online

Model	Precision	Offline Throughput (req/s)	Online Throughput (req/s)	Offline Mean Latency (s)	Online Mean Latency (s)	Online Peak Mem (MB)
HunyuanImage3	BF16	0.2362	0.19	4.1992	142.5196	66526
HunyuanImage3	ModelOpt FP8	0.2606	0.23	3.8015	119.0200	65898
Qwen-Image-2512	BF16	0.2853	0.29	3.4734	93.7505	59404
Qwen-Image-2512	ModelOpt FP8	0.3139	0.31	3.1535	86.3423	52764
Z-Image	BF16	0.3717	0.38	2.6621	72.0314	23852
Z-Image	ModelOpt FP8	0.3879	0.39	2.5444	68.8272	22052
FLUX.2-dev	BF16	0.1040	0.21	9.5824	131.8690	87280
FLUX.2-dev	ModelOpt FP8	0.0913	0.18	10.9228	150.1522	72304
FLUX.2-klein-4B	BF16	1.5477	1.63	0.6122	16.5748	21418
FLUX.2-klein-4B	ModelOpt FP8	1.5823	1.69	0.5942	16.0315	19670

Observations

HunyuanImage3, Qwen-Image-2512, Z-Image, and FLUX.2-klein-4B all show consistent gains from ModelOpt FP8 in both offline and online settings.
FLUX.2-dev is the main exception in this set: ModelOpt FP8 reduces peak memory, but both offline and online throughput regress relative to BF16.
The largest online improvement in this batch is HunyuanImage3, with roughly 21% throughput gain and 16% mean latency reduction.

TODO

FLUX.2-dev ‘qwen image ’ lantency
backend currently forced to cutlass --For ModelOpt FP8 diffusion, each layer still follows:
BF16 activation → FP8 activation quantization → FP8 GEMM → BF16 output
--

Test Plan

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux1-dev-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/flux_dit_2gpu_fp8.yaml \
  --prompt "a small red ceramic teapot on a wooden table, soft window light" \
  --height 512 \
  --width 512 \
  --num-inference-steps 2 \
  --seed 42 \
  --output outputs/flux_modelopt_fp8.png \
  --enforce-eager \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux_modelopt_fp8.log


 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux2-klein-4b-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/flux2_klein_dit_2gpu_fp8.yaml \
  --prompt "a cozy Tokyo cafe corner at night, warm tungsten lighting, rain on the window, ceramic coffee cup, highly detailed, cinematic photograph" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/flux2_klein_modelopt_fp8.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_klein_modelopt_fp8.log

modeloptfp8 for qwen-image:
https://paste.ubuntu.com/p/gby859n2Qt/

 CUDA_VISIBLE_DEVICES=0 \
/root/zdj/vllm/.venv/bin/python \
  /tmp/quantize_qwen_image_modelopt_fp8.py \
  --model /root/zdj/models/qwen-image \
  --output /root/zdj/models/qwen-image-modelopt-fp8 \
  --calib-size 8 \
  --calib-steps 8 \
  --height 512 \
  --width 512 \
  --overwrite \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_export.log

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/qwen-image-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --prompt "a clean product photo of a blue enamel mug on a white desk, realistic lighting" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/qwen_image_modelopt_fp8.png \
  --enforce-eager \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8.log

hunyuan modeoptfp8 ： https://paste.ubuntu.com/p/dTgpmNzw3K/

CUDA_VISIBLE_DEVICES=0,1
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH
/root/zdj/vllm/.venv/bin/python
examples/offline_inference/text_to_image/text_to_image.py
--model /root/zdj/models/hunyuan-image3-modelopt-fp8
--stage-configs-path vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml
--prompt "a cinematic photo of a red fox standing in a snowy pine forest, soft morning light, highly detailed"
--guidance-scale 4.0
--height 512
--width 512
--num-inference-steps 20
--seed 42
--use-system-prompt en_vanilla
--output outputs/hunyuan_image3_modelopt_fp8_steps20.png
--stage-init-timeout 900
--init-timeout 900
2>&1 | tee outputs/hunyuan_image3_modelopt_fp8_steps20.log

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/hunyuan-image3-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml \
  --prompt "a cinematic close-up photo of a glass greenhouse in a snowy mountain village at sunrise, warm golden light glowing through the windows, frost on the glass, pine trees, soft mist, ultra detailed, realistic photography" \
  --guidance-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 123 \
  --use-system-prompt en_vanilla \
  --output outputs/hunyuan_image3_modelopt_fp8_greenhouse_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/hunyuan_image3_modelopt_fp8_greenhouse_steps20.log

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector · 2026-04-19T06:46:06Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

baonudesifeizhai · 2026-04-19T07:53:03Z

flux2dev modelopt fp8 script:
https://paste.ubuntu.com/p/Pkw5Wsjv4q/

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux2-dev-modelopt-fp8 \
  --stage-configs-path /tmp/flux2_dev_dit_2gpu_fp8.yaml \
  --prompt "a luxury art deco train dining car at golden hour, emerald velvet seats, brass lamps, rain streaks on the windows, cinematic wide angle photograph, highly detailed" \
  --guidance-scale 2.5 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 123 \
  --output outputs/flux2_dev_modelopt_fp8_artdeco_train_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_dev_modelopt_fp8_artdeco_train_steps20.log

hsliuustc0106 · 2026-04-19T09:24:10Z

BLOCKING:

Test Coverage — No e2e online serving test. Please add a test that:
1. Starts vllm serve <model> --omni
2. Sends a generation request via the API
3. Asserts the response contains a valid image

ModelOpt FP8 checkpoints should work in both Omni (offline) and vllm serve / AsyncOmni (online) modes before merging.

…vllm-project#2920) Threads quant_config / prefix through HunyuanVideo15Attention, HunyuanVideo15TransformerBlock, and HunyuanVideo15Transformer3DModel so the modelopt FP8 adapter from vllm-project#2913 has somewhere to bind per-layer scales. Modulation, embeddings, proj_out stay raw nn.Linear (full precision). Signed-off-by: lishunyang <lishunyang12@163.com>

…eo-1.5 examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps, skips precision-sensitive layers (modulation, embeddings, output proj, token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920). vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter). Signed-off-by: lishunyang <lishunyang12@163.com>

baonudesifeizhai · 2026-04-19T19:23:38Z

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/flux2-dev-modelopt-fp8 \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path /tmp/flux2_dev_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_dev_modelopt_fp8_online_server.log

prompt:https://paste.ubuntu.com/p/ypkqDtNxQN/

The default export_hf_checkpoint() doesn't actually serialize weights as FP8 for unknown model types like HunyuanVideo15Transformer3DModel — it saves BF16 placeholders. The HunyuanImage-3 calibration helper hit the same bug. Three changes: - Manually call modelopt.torch.export.unified_export_hf._export_quantized_weight per-module to convert in-memory tensors to actual FP8. - Save the pipeline by hand (copy source minus transformer/, then save the quantized transformer with hide_quantizers_from_state_dict). - Patch transformer/config.json to inject quant_algo: FP8 + config_groups so vllm-omni's adapter (vllm-project#2913) auto-detects it. Signed-off-by: lishunyang <lishunyang12@163.com>

…block When --weight-block-size 'M,N' is given, override the weight quantizer with block_sizes={-1: N, -2: M} so each linear gets a (out//M, in//N) scale tensor instead of a scalar. Patched config_groups advertises strategy='block' + block_structure='MxN' so consumers know what to expect. Static FP8 is exempt from upstream vLLM's online block-wise gate, so this just works at serving time via vllm-project#2913's adapter. Default behavior unchanged (per-tensor) — pass --weight-block-size 128,128 to opt in. Signed-off-by: lishunyang <lishunyang12@163.com>

…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>

baonudesifeizhai · 2026-04-20T00:35:16Z

z image :
For Z-Image ModelOpt FP8, the main caution is that not all linear layers are equally stable under FP8.
Use a conservative quantization profile:
Also preserve full transformer submodule prefixes during loading. Z-Image ignore-list matching depends on names like layers..attention.to_out, layers..feed_forward.w2, noise_refiner., and context_refiner.; wrong prefixes can silently produce corrupted images.
https://paste.ubuntu.com/p/F8hdb5SMnY/

CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  /tmp/quantize_z_image_base_modelopt_fp8.py \
  --model /root/zdj/models/z-image \
  --output /root/zdj/models/z-image-modelopt-fp8-conservative \
  --profile conservative \
  --calib-size 8 \
  --calib-steps 28 \
  --height 512 \
  --width 512 \
  --guidance-scale 4.0 \
  --overwrite \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_export.log

offline

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/z-image-modelopt-fp8-conservative \
  --stage-configs-path vllm_omni/model_executor/stage_configs/z_image_dit_2gpu_fp8.yaml \
  --prompt "an Elden Ring style lone tarnished knight standing before a shattered cathedral under a dying golden tree, ruined stone arches, drifting ash, dramatic god rays, dark fantasy, cinematic, ultra detailed" \
  --negative-prompt "blurry, low quality, distorted, deformed, watermark" \
  --guidance-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 28 \
  --seed 42 \
  --output outputs/z_image_modelopt_fp8_conservative_steps28.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_steps28.log

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/z-image-modelopt-fp8-conservative \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/z_image_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_online_server.log

 curl -s http://127.0.0.1:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a Horus Heresy scene, a towering Space Marine in battered crusade-era power armor standing inside a ruined imperial cathedral during the age of civil war, shattered aquila banners, burning censers, broken stained glass, ash and embers drifting through the air, tragic gothic atmosphere, dramatic god rays, cinematic, ultra detailed",
    "negative_prompt": "blurry, low quality, distorted, deformed, watermark, extra limbs, bad anatomy",
    "size": "512x512",
    "num_inference_steps": 28,
    "guidance_scale": 4.0,
    "seed": 42,
    "response_format": "b64_json"
  }' \
  | jq -r '.data[0].b64_json' \
  | base64 -d \
  > outputs/z_image_modelopt_fp8_conservative_online_horus_heresy_steps28.png

baonudesifeizhai · 2026-04-20T15:12:09Z

for online qwen-image:

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/qwen-image-modelopt-fp8 \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_online_2gpu_server.log

curl -sS http://127.0.0.1:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "/root/zdj/models/qwen-image-modelopt-fp8",
    "prompt": "a cinematic photo of a red fox standing in a snowy pine forest, soft morning light, highly detailed, realistic fur texture",
    "negative_prompt": "blurry, low quality, distorted, deformed, oversaturated",
    "size": "512x512",
    "response_format": "b64_json",
    "n": 1,
    "num_inference_steps": 20,
    "true_cfg_scale": 4.0,
    "seed": 42
  }' \
  | tee outputs/qwen_image_modelopt_fp8_online_2gpu_response.json
python - <<'PY'
import base64, json
from pathlib import Path

payload = json.loads(Path("outputs/qwen_image_modelopt_fp8_online_2gpu_response.json").read_text())
Path("outputs/qwen_image_modelopt_fp8_online_2gpu_steps20.png").write_bytes(
    base64.b64decode(payload["data"][0]["b64_json"])
)
print("saved outputs/qwen_image_modelopt_fp8_online_2gpu_steps20.png")
PY

offline:

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/qwen-image-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --prompt "a grimdark Warhammer 40,000 style hive city stretching into a poisoned orange sky, endless gothic megastructures, towering manufactorum spires, cathedral-like hab blocks, polluted atmosphere, flying gunships, crowds of tiny pilgrims and workers below, dramatic volumetric light, ash and smoke, cinematic, ultra detailed" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated, watermark, text" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/qwen_image_modelopt_fp8_warhammer40k_hive_city_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_warhammer40k_hive_city_steps20.log

baonudesifeizhai · 2026-04-20T18:18:41Z

for e2e test:
CUDA_VISIBLE_DEVICES=0,1
VLLM_TARGET_DEVICE=cuda
VLLM_WORKER_MULTIPROC_METHOD=spawn
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH
/root/zdj/vllm/.venv/bin/python -m pytest
tests/e2e/online_serving/test_modelopt_fp8_image_serving.py::test_modelopt_fp8_images_api_returns_valid_image
-s
--run-level advanced_model
--tb=short
passed

david6666666 · 2026-04-21T06:55:40Z

We should have a unified model weight conversion script, such as those in vllm-omni/vllm_omni/quantization/tools, and compare_diffusion_trajectory_similarity scripts. WDYT @baonudesifeizhai @lishunyang12

lishunyang12 · 2026-04-23T09:49:39Z

Quality outputs look good but we have no perf numbers for any of the 5 models. Can you share:

Latency + peak memory table for bf16 vs modelopt-fp8 (at least Flux + HunyuanImage-3)
Profiler trace per the profiling guide for one model — top-N kernels to confirm fp8 GEMM path is actually active and not silently falling back to bf16
List of layers that fell back to bf16 (skipped/unsupported) and why

Want to validate the perf story before merging.

baonudesifeizhai · 2026-04-24T04:25:01Z

after force_kernel=PerTensorTorchFP8ScaledMMLinearKernel on vllm side ...

 


================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-dev-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  84.41
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.59
Latency Mean (s):                        22.9657
Latency Median (s):                      26.9953
Latency P99 (s):                         27.0379
Latency P95 (s):                         27.0342
--------------------------------------------------
Peak Memory Max (MB):                    65390.00
Peak Memory Mean (MB):                   65390.00
Peak Memory Median (MB):                 65390.00

vs 


================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-dev
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  92.37
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.54
Latency Mean (s):                        25.1176
Latency Median (s):                      29.5167
Latency P99 (s):                         29.5819
Latency P95 (s):                         29.5704
--------------------------------------------------
Peak Memory Max (MB):                    80366.00
Peak Memory Mean (MB):                   80366.00
Peak Memory Median (MB):                 80366.00

============================================================
Metrics saved to outputs/perf/flux2_dev_bf16_2gpu_c16_n50.json

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-klein-4b-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  18.59
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              2.69
Latency Mean (s):                        5.0586
Latency Median (s):                      5.9344
Latency P99 (s):                         5.9611
Latency P95 (s):                         5.9536
--------------------------------------------------
Peak Memory Max (MB):                    12758.00
Peak Memory Mean (MB):                   12758.00
Peak Memory Median (MB):                 12758.00
vs 

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-klein-4b
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  19.69
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              2.54
Latency Mean (s):                        5.3560
Latency Median (s):                      6.2916
Latency P99 (s):                         6.3101
Latency P95 (s):                         6.3021
--------------------------------------------------
Peak Memory Max (MB):                    14506.00
Peak Memory Mean (MB):                   14506.00
Peak Memory Median (MB):                 14506.00

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/hunyuan-image3-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  241.25
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.21
Latency Mean (s):                        65.6834
Latency Median (s):                      76.9524
Latency P99 (s):                         77.5409
Latency P95 (s):                         77.2703
--------------------------------------------------
Peak Memory Max (MB):                    96940.00
Peak Memory Mean (MB):                   96940.00
Peak Memory Median (MB):                 96940.00

============================================================

vs 

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/hunyuan-image3
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  282.36
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.18
Latency Mean (s):                        76.8621
Latency Median (s):                      90.0141
Latency P99 (s):                         90.8956
Latency P95 (s):                         90.5725
--------------------------------------------------
Peak Memory Max (MB):                    135402.00
Peak Memory Mean (MB):                   135402.00
Peak Memory Median (MB):                 135402.00

============================================================

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  110.26
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.45
Latency Mean (s):                        29.9892
Latency Median (s):                      35.1732
Latency P99 (s):                         35.3604
Latency P95 (s):                         35.3243
--------------------------------------------------
Peak Memory Max (MB):                    39730.00
Peak Memory Mean (MB):                   39729.60
Peak Memory Median (MB):                 39730.00

============================================================
vs

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  102.26
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.49
Latency Mean (s):                        27.8345
Latency Median (s):                      32.6885
Latency P99 (s):                         32.7558
Latency P95 (s):                         32.7227
--------------------------------------------------
Peak Memory Max (MB):                    46464.00
Peak Memory Mean (MB):                   46464.00
Peak Memory Median (MB):                 46464.00

============================================================

baonudesifeizhai · 2026-04-25T01:56:02Z

https://paste.ubuntu.com/p/92yBc9x7bB/

curl -sS http://127.0.0.1:8160/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data-raw '{
    "model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "messages": [
      {
        "role": "user",
        "content": "A beautiful cinematic photo of a small red fox sitting in a snowy forest, ultra detailed, soft natural light"
      }
    ],
    "max_tokens": 1024
  }' \
| /root/zdj/vllm/.venv/bin/python -c 'import sys,json,re,base64; r=json.load(sys.stdin); s=json.dumps(r); m=re.search(r"data:image/png;base64,([A-Za-z0-9+/=]+)", s); assert m, s[:1000]; open("output.png","wb").write(base64.b64decode(m.group(1))); print("saved output.png")'

 curl -sS http://127.0.0.1:8160/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data-raw '{
    "model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "messages": [
      {
        "role": "user",
        "content": "A colossal warp monster emerging from a torn reality rift inside a gothic sci-fi battlefield, grimdark far future war aesthetic, twisted horns, glowing eyes, corrupted flesh, black armor fragments, chaotic purple and red energy, cathedral ruins, smoke, fire, cinematic lighting, ultra detailed, dramatic composition"
      }
    ],
    "max_tokens": 1024
  }' \
| /root/zdj/vllm/.venv/bin/python -c 'import sys,json,re,base64; r=json.load(sys.stdin); s=json.dumps(r); m=re.search(r"data:image/png;base64,([A-Za-z0-9+/=]+)", s); assert m, s[:1000]; open("warpspawn_40k_grimdark.png","wb").write(base64.b64decode(m.group(1))); print("saved warpspawn_40k_grimdark.png")'

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  254.98
Request rate:                            inf
Max request concurrency:                 32
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.20
Latency Mean (s):                        113.6107
Latency Median (s):                      129.1408
Latency P99 (s):                         180.3258
Latency P95 (s):                         180.2265
--------------------------------------------------
Peak Memory Max (MB):                    84630.00
Peak Memory Mean (MB):                   81597.36
Peak Memory Median (MB):                 84630.00
vs 


================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-2512
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  285.53
Request rate:                            inf
Max request concurrency:                 32
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.18
Latency Mean (s):                        127.2317
Latency Median (s):                      144.4706
Latency P99 (s):                         202.5307
Latency P95 (s):                         202.4224
--------------------------------------------------
Peak Memory Max (MB):                    97340.00
Peak Memory Mean (MB):                   94306.96
Peak Memory Median (MB):                 97340.00

============================================================

baonudesifeizhai · 2026-04-26T08:08:47Z

cat >/tmp/modelopt_quality_cases.json <<'JSON'
[
  {
    "id": "qwen_image_2512_modelopt_fp8_dynamic_all",
    "baseline_model": "/root/zdj/models/qwen-image-2512",
    "quantized_model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "task": "t2i",
    "prompt": "a fox sitting in the snow in a forest, realistic photo",
    "max_lpips": 0.35,
    "height": 1024,
    "width": 1024,
    "num_inference_steps": 20,
    "seed": 42,
    "negative_prompt": "blurry, low quality"
  }
]
JSON
export VLLM_OMNI_QUALITY_CONFIGS=/tmp/modelopt_quality_cases.json
export VLLM_OMNI_QUALITY_OUTPUT_DIR=/tmp/modelopt_quality_outputs

PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:${PYTHONPATH:-} \
/root/zdj/vllm/.venv/bin/python -m pytest \
  tests/diffusion/quantization/test_quantization_quality.py \
  -v -m "" -k qwen_image_2512_modelopt_fp8_dynamic_all

tests/diffusion/quantization/test_quantization_quality.py::test_quantization_quality[qwen_image_2512_modelopt_fp8_dynamic_all] PASSED [100%]

david6666666 · 2026-04-27T01:14:05Z

            self.shared_experts = None

-        self.experts = SharedFusedMoE(
+        self.experts = FusedMoE(


We haven't validated this model yet, so we won't modify it for now.

david6666666 · 2026-04-27T01:14:54Z

 from vllm.inputs import MultiModalDataDict
 from vllm.logger import init_logger
-from vllm.model_executor.layers.fused_moe import SharedFusedMoE
+from vllm.model_executor.layers.fused_moe import FusedMoE


Why should this be changed to FusedMoE

david6666666 · 2026-04-27T01:15:36Z

 from vllm.entrypoints.pooling.embed.serving import ServingEmbedding as OpenAIServingEmbedding
-from vllm.entrypoints.pooling.pooling.serving import OpenAIServingPooling
-from vllm.entrypoints.pooling.score.serving import ServingScores
+from vllm.entrypoints.pooling.pooling.serving import ServingPooling as OpenAIServingPooling


Why should we modify it here?

david6666666 · 2026-04-27T01:16:03Z


 import vllm.forward_context as _vllm_fc
-from vllm.model_executor.layers.fused_moe import SharedFusedMoE
+from vllm.model_executor.layers.fused_moe import FusedMoE


david6666666 · 2026-04-27T01:17:00Z

        return rotary_position_embedding(x, cos, sin, rotated_mode="rotated_half", head_first=False, fused=True)


+def _ensure_batch_dim(x: torch.Tensor) -> tuple[torch.Tensor, bool]:


Why should we modify it here?

david6666666 · 2026-04-27T01:19:59Z

        quantization="fp8",
        task="t2i",
        prompt="a cup of coffee on a wooden table, morning light",
        max_lpips=0.35,


I think the max_lpips threshold was set too arbitrarily before, and one metric isn't enough; we need to add metrics like psnr or mae to monitor it. I believe we should define this threshold properly first.

david6666666 · 2026-04-27T01:22:40Z

@@ -0,0 +1,86 @@
+# SPDX-License-Identifier: Apache-2.0


I think this test should include accuracy-related tests; simply testing functionality is meaningless.

baonudesifeizhai · 2026-05-08T03:20:55Z

CUDA_VISIBLE_DEVICES=2,1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:${PYTHONPATH:-} \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main serve \
  /root/zdj/models/z-image-modelopt-fp8-conservative \
  --omni \
  --host 0.0.0.0 \
  --port 8102 \
  --tensor-parallel-size 2 \
  --force-cutlass-fp8 \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_cli_cutlass_server.log

mkdir -p /root/zdj/vllm-omni/outputs

curl -sS http://127.0.0.1:8102/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/root/zdj/models/z-image-modelopt-fp8-conservative",
    "prompt": "grimdark far-future gothic sci-fi battlefield, a towering power-armored knight in black and crimson armor, cathedral ruins, burning incense, ash storm, massive gothic machinery, dramatic rim light, ultra detailed, cinematic, no text, no logo",
    "size": "1024x1024",
    "num_inference_steps": 20,
    "seed": 42,
    "negative_prompt": "blurry, low quality, distorted, deformed, oversaturated, text, logo"
  }' | /root/zdj/vllm/.venv/bin/python -c '
import sys, json, base64
r = json.load(sys.stdin)
open("/root/zdj/vllm-omni/outputs/z_image_modelopt_fp8_grimdark_40k_style.png", "wb").write(
    base64.b64decode(r["data"][0]["b64_json"])
)
'

david6666666 · 2026-05-08T06:29:43Z

 CLI:

 ```bash
 python text_to_image.py --model <your-model> --quantization fp8


we should add modelopt.md and .nav.yml such as https://docs.vllm.ai/en/latest/features/quantization/modelopt/ and follow vllm-omni quantization style.

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

david6666666 · 2026-05-09T08:47:02Z

LGTM now. @lishunyang12 PTAL thx

david6666666 · 2026-05-09T09:25:18Z

We may need to add a modelopt quantization script tool later. Thank you for your contribution.

vllm-project#2709 (vllm-project#2913) Signed-off-by: roG0d <rodgarcas98@gmail.com> Signed-off-by: roG0d <baonudesifeizhai@gmail.com> Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com> Co-authored-by: roG0d <rodgarcas98@gmail.com>

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> Signed-off-by: Zeyu Huang | 黃澤宇 <11222265+fhfuih@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

baonudesifeizhai requested a review from hsliuustc0106 as a code owner April 19, 2026 06:46

lishunyang12 mentioned this pull request Apr 19, 2026

[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2 #2924

Draft

15 tasks

lishunyang12 mentioned this pull request Apr 19, 2026

[Quant] Phase 1 (video-gen): ModelOpt FP8 for Wan2.2 TI2V-5B #2927

Closed

8 tasks

david6666666 requested review from david6666666 and lishunyang12 April 20, 2026 12:43

baonudesifeizhai force-pushed the omni2709 branch from a9b3165 to 263be06 Compare April 20, 2026 18:08

lishunyang12 mentioned this pull request Apr 24, 2026

[RFC]: Continuous Quantization Support #1854

Open

vllm-project deleted a comment from AILIFE1 Apr 24, 2026

baonudesifeizhai force-pushed the omni2709 branch from a689e90 to 22fbfd5 Compare April 25, 2026 20:06

baonudesifeizhai mentioned this pull request Apr 27, 2026

[RFC]: Unified ModelOpt Quantization in vLLM vllm-project/vllm#40182

Open

1 task

david6666666 reviewed Apr 27, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/hunyuan_image3/hunyuan_image3_transformer.py Outdated

Comment thread vllm_omni/quantization/factory.py Outdated

david6666666 reviewed May 8, 2026

View reviewed changes

david6666666 mentioned this pull request May 8, 2026

[RFC] [0.22.0]: Quantization Support JiusiServe/vllm-omni#182

Open

8 tasks

ArtificialRay mentioned this pull request May 8, 2026

Phase1 (video-gen) ModelOpt FP8 Follow-ups lishunyang12/vllm-omni#57

Open

18 tasks

add doc

9dd0b3b

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

baonudesifeizhai requested review from Gaohan123, Isotr0py, RuixiangMa, SamitHuang, ZJY0516, ZeldaHuang, linyueqian, princepride, tzhouam, wtomin, yenuo26 and yuanheng-zhao as code owners May 8, 2026 20:40

david6666666 reviewed May 9, 2026

View reviewed changes

Comment thread docs/user_guide/quantization/modelopt.md Outdated

add to feizhai12 huggingface

eff5468

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

david6666666 added the ready label to trigger buildkite CI label May 9, 2026

fix ci

2bf7356

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

david6666666 approved these changes May 9, 2026

View reviewed changes

lishunyang12 merged commit c4a0990 into vllm-project:main May 9, 2026
8 checks passed

fhfuih added a commit to fhfuih/vllm-omni that referenced this pull request May 15, 2026

Fix diffusers backend input bug after vllm-project#2913

23a52a8

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fhfuih mentioned this pull request May 15, 2026

[bugfix] Fix diffusers backend input bug after #2913 #3644

Merged

5 tasks

		return rotary_position_embedding(x, cos, sin, rotated_mode="rotated_half", head_first=False, fused=True)


		def _ensure_batch_dim(x: torch.Tensor) -> tuple[torch.Tensor, bool]:

Conversation

baonudesifeizhai commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Validation

Benchmark Setup

BF16 vs ModelOpt FP8

Offline vs Online

Observations

TODO

Test Plan

Uh oh!

chatgpt-codex-connector Bot commented Apr 19, 2026

Uh oh!

baonudesifeizhai commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 19, 2026

Uh oh!

baonudesifeizhai commented Apr 19, 2026

Uh oh!

baonudesifeizhai commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baonudesifeizhai commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baonudesifeizhai commented Apr 20, 2026

Uh oh!

david6666666 commented Apr 21, 2026

Uh oh!

lishunyang12 commented Apr 23, 2026

Uh oh!

baonudesifeizhai commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baonudesifeizhai commented Apr 25, 2026

Uh oh!

baonudesifeizhai commented Apr 26, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

baonudesifeizhai commented May 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

david6666666 commented May 9, 2026

Uh oh!

Uh oh!

david6666666 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

baonudesifeizhai commented Apr 19, 2026 •

edited

Loading

baonudesifeizhai commented Apr 19, 2026 •

edited

Loading

baonudesifeizhai commented Apr 20, 2026 •

edited

Loading

baonudesifeizhai commented Apr 20, 2026 •

edited

Loading

baonudesifeizhai commented Apr 24, 2026 •

edited

Loading